The name of the course is Introduction to Open Data Science and we are focusing to language R, RStudio, GitHub and Markdown. You can find my Github repository here.
After the data wrangling exercise the new data set is found from the data folder. The set is based on data that was collected from course Introduction to Social Statistics, fall 2014 - in Finnish. The survey was conducted 3.12.2014 - 10.1.2015 by Kimmo Vehkalahti.
student2014 <- read.table("data/learning2014.txt", header = TRUE)
dim(student2014)
## [1] 166 7
The student data includes 7 variables and 166 rows.
str(student2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
Variables deep, strat and surf are combination variables from the original survey data.
summary(student2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
From 166 survey respondents 56 was men and 110 was females. The mean of age was 25,5 years. Oldes respondet was 55 and youngest 17 years old.
Variables differ between genders. Distributions are different in age, attitude and surf (surfface). Three highest correlation between variables are:
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = student2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
In this linear regression model points are the target variable and attitude, strategy (stra) and surfface (surf) are explanatory variables.
Residuals of the model are between ~ -17.2 and ~10.9 when median is 0.52. I assume that errors are normally distributed but distribution needs confirmation.
Attitude is the only variable that has a very good significance in this model. p-value of stra and surf is too high to be even slightly significance.
Variables estimated coefficient is ~0.34 and it’s standard error is clearly smaller (~0.057). Other explanatory variables have not significance in this model.
Residual standard error is high in relation to a first and third quantiles of residuals.
This linear regression model of three explanatory variables explains ~19.3% of the points.
##
## Call:
## lm(formula = points ~ attitude, data = student2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Estimated coefficient of explanatory variable attitude is ~0.35. This means that when attitude rises one point target variable (points) grows 0.35 times.
Attitude p-value (0.00000000412) shows that variable is very significant in this linear regression model
This model explains 18.6% of the points (target variable)
Residuals vs Fitted This plot shows that there is no pattern between residuals. There is a constant variance among errors. One can confirm that the assumption of constant variance of errors is valid.
Normal Q_Q Normal QQ-plot shows that the errors of the model are normally distributed.
Residual vs Leverage Last plot of the graphical model validation shows that the impact of the singel observation is moderate. Model includes some outliers but the leverage of singel observation don’t compromise the validation of the model.
Using Data Mining To Predict Secondary School Student Alcohol Consumption. Fabio Pagnotta, Hossain Mohammad Amran Department of Computer Science,University of Camerino
https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION
Data is rolled into one from Math course and Portuguese language course datasets. After the data wrangling exercise the new data set is found from the data folder.
alc <- read.table("data/student_alc.txt", header = TRUE)
dim(alc)
## [1] 382 35
The student data includes 35 variables and 382 rows.
In this analysis I’m going to study the relationship of high/low alcohol consumption between sex and the following variables:
| Variable | Type | Description |
|---|---|---|
| age | numeric | student’s age |
| studytime | numeric, scale 1-4 | weekly study time |
| freentime | numeric, scale 1-5 | free time after school |
| absence | numeric | number of school absences |
First these relationships are observed from tables and graphics. Hypothesis are as follows:
Age
The age of high consumption of alcohol may differ between sex. The development of charcter differs between young people and this may affect on habbits of alcohol consumption.
H0: Age don’t affect on alcohol consumption
H1: There is difference in between level of alcohol consumption and age
#a jitterplot of high_use, sex and age
g1 <- ggplot(alc, aes(x = high_use, y = age, col = sex))
g1 + geom_jitter() + ggtitle("Age by alcohol consumption and sex")
It seems that there is randomnes of sex and age in both alcohol consumption groups.
#a boxplot of high_use, sex and age
g1 <- ggplot(alc, aes(x = high_use, y = age, col = sex))
g1 + geom_boxplot() + ggtitle("Age by alcohol consumption and sex") + xlab("High consumption group") + ylab("Age of student")
Means however show that young male students have lower mean of age in low consumption group and females in high consumption group.
Study time
High alcohol consumption may be related to time spend in studies because one can’t do both at the same time at least not successfully.
H0: Alcohol consumption do not affect on weekly study time
H1: Alcohol consumption affects on weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
#barplot about study time
ggplot(alc, aes(studytime, fill = high_use)) + geom_bar(position = "fill") +
ggtitle("Barplot about study time grouped by high_use")+ xlab("Study time") + ylab("Propotions of students") + scale_y_continuous(name = waiver(), breaks = waiver(), minor_breaks = waiver(), labels = waiver(), limits = NULL, expand = waiver(), na.value = NA_real_, trans = "identity")
It seems that there is less high users in those students groups who spend more time in studying (3-4) than in those who spend less time in studying (1-2).
Free time
High alcohol consumption may be related to free time so that studets who have more free time are consuming more alcohol that studets who haven’t as much free time.
H0: Amount of free time do not affect on alcohol consumption among students
H1: Amount of free time affects on alcohol consumption among students
#barplot about free time
ggplot(alc, aes(freetime, fill = high_use)) + geom_bar(position = "fill") +
ggtitle("Barplot about free time grouped by high_use")+ xlab("Free time") + ylab("Propotions of students") + scale_y_continuous(name = waiver(), breaks = waiver(), minor_breaks = waiver(), labels = waiver(), limits = NULL, expand = waiver(), na.value = NA_real_, trans = "identity")
It seems that there are more students that are consuming alcohol high amounts in those students groups that have more free time than in those who have not as much free time.
Absences
High consumption of alcohol may cause absences.
H0: High consumption of alcohol do not affect on absences
H1: High consumption of alcohol does affect on absences
#a boxplot of high_use and absences
g1 <- ggplot(alc, aes(x = high_use, y = absences, fill = high_use))
g1 + geom_boxplot() + ggtitle("Absences by alcohol consumption") + xlab("High consumption group") + ylab("Absences")
It seems that the differences between alcohol consumption groups in absences are small. Mean of absences is higher in high consumption group but it may not be significant.
#the model with glm()
m <- glm(high_use ~ age + studytime + freetime + absences, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ age + studytime + freetime + absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9649 -0.8267 -0.6238 1.0990 2.2861
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.39240 1.79306 -2.450 0.014299 *
## age 0.18543 0.10313 1.798 0.072183 .
## studytime -0.51842 0.15867 -3.267 0.001086 **
## freetime 0.33015 0.12430 2.656 0.007907 **
## absences 0.07856 0.02269 3.463 0.000535 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 423.18 on 377 degrees of freedom
## AIC: 433.18
##
## Number of Fisher Scoring iterations: 4
From the fitted model one can see that all explanatory variables except age are statistically significant with p-value < 0,01. Variable absence is also statistically sisgnificant with p-value < 0,001. It seems that age doesn’t explain whether or not a student is a high user of alcohol.
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.01237102 0.0003481843 0.3998333
## age 1.20373817 0.9852330457 1.4775972
## studytime 0.59545916 0.4321414645 0.8061535
## freetime 1.39118359 1.0938389338 1.7826010
## absences 1.08172440 1.0368164005 1.1336048
Odds ratio (OR) and the 95% confidence interval (CI) shows that those students who have a low study time are almost two times as likely to be a high user of alcohol than those studets who have higher study time. Students that have more freetime are also more likely to be a high user of alcohol. Also absences are positively correlated with high use of alcohol. Confidence interval shows that age is not statistically significant (because the interval contains 1) and other variables are.
Predictive power of the final logistic regression model is calculated without the statistically insignificant variable age.
#the model with glm() and without the age variable
m_final <- glm(high_use ~ studytime + freetime + absences, data = alc, family = "binomial")
summary(m_final)
##
## Call:
## glm(formula = high_use ~ studytime + freetime + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9420 -0.8332 -0.6450 1.1266 2.1537
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.34105 0.56337 -2.380 0.017293 *
## studytime -0.50496 0.15691 -3.218 0.001290 **
## freetime 0.32626 0.12379 2.636 0.008401 **
## absences 0.08324 0.02262 3.680 0.000233 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 426.47 on 378 degrees of freedom
## AIC: 434.47
##
## Number of Fisher Scoring iterations: 4
#predict and add the answer and the prediction to the data (alc)
probabilities <- predict(m_final, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probabilities > 0.5)
#tabulate the target variable versus the prediction
table("High use" = alc$high_use, "Prediction" = alc$prediction)
## Prediction
## High use FALSE TRUE
## FALSE 254 14
## TRUE 93 21
Table shows that the model predict 254 true negatives, 21 true positives, 14 false negatives and 93 false postives. This is sometimes called “confusion table”
table("High use" = alc$high_use, "Prediction" = alc$prediction) %>% prop.table() %>% addmargins()
## Prediction
## High use FALSE TRUE Sum
## FALSE 0.66492147 0.03664921 0.70157068
## TRUE 0.24345550 0.05497382 0.29842932
## Sum 0.90837696 0.09162304 1.00000000
Propabilities of the same table shows that 90,8% is predicted to be false but only 66,5% of them is correct. 9,2% is predicted to be true but 5,5% of them realy are students with high use of alcohol.
#a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
#defining a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
#calling loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2801047
The average number of wrong predictions in trainig data is 28%.
#computing the average number of wrong predictions in the (training) data
#loss_func(class = alc$high_use, prob = alc$probability)
#K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m_final, K = 10)
#average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2853403
10-fold cross-validation gives good estimate of the actual predictive power of the model. Low value = good.
In this exercise we use Boston data from MASS-library. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. Data includes 14 variables and 506 rows.
## [1] 506 14
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
| variable | description |
|---|---|
| crim | per capita crime rate by town |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
| indus | proportion of non-retail business acres per town |
| chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) |
| nox | nitrogen oxides concentration (parts per 10 million) |
| rm | average number of rooms per dwelling |
| age | proportion of owner-occupied units built prior to 1940 |
| dis | weighted mean of distances to five Boston employment centres |
| rad | index of accessibility to radial highways |
| tax | full-value property-tax rate per $10,000 |
| ptratio | pupil-teacher ratio by town |
| black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town |
| lstat | lower status of the population (percent) |
| medv | median value of owner-occupied homes in $1000 |
There are some very intresting distributions fo variables in the plot matrix. Variable rad has high and low values so the plot shows that the values are consenrated either side of the plot. VAriable *
Plotted correlation matrix shows that there is some high correlation between variables:
Correlation is quite clear between industrial areas (indus) and nitrogen oxides (nox). Industry adds pollution in the area. Industry seems to correlate also with variablrs like age, dis, ras and tax.
Nitrogen oxides (nox) correlations are very similar with industry (indus)
Crime rate (crim) seems to correlate with good accessibilitty to radial highways (rad) and value property (tax).
Old houses (age) and employment centers have also something common
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
All the variables are numerical so we can use scale()-function to scale whole data set.
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
## [1] "matrix"
Scaling the data makes variables look as if they are in the same range. Variables like black and tax were before scaling hundred fold compared to some other variables.
Variable crim is the base of the new categorical variable crime.
| categories | quantile points |
|---|---|
| low | 0%-25% |
| med_low | 25%-50% |
| med_high | 50%-75% |
| high | 75%-100% |
Quantile points of the variable crim
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
## crime
## low med_low med_high high
## 127 126 126 127
## zn indus chas nox
## Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
## Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
## Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
## rm age dis rad
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
## medv crime
## Min. :-1.9063 low :127
## 1st Qu.:-0.5989 med_low :126
## Median :-0.1449 med_high:126
## Mean : 0.0000 high :127
## 3rd Qu.: 0.2683
## Max. : 2.9865
Training set contains 80% of the data. 20% is in the test set.
## [1] 36 352 464 226 204 52 54 486 382 128 463 51 479 68 282 444 422
## [18] 219 398 85 12 262 182 415 23 117 414 109 386 180 389 429 146 162
## [35] 448 375 43 296 88 401 208 232 354 301 108 175 59 293 95 1 124
## [52] 284 221 306 343 84 437 435 78 170 222 360 46 276 116 439 467 404
## [69] 447 138 329 216 25 380 469 265 254 247 450 391 393 287 73 367 91
## [86] 71 241 157 135 337 270 485 452 19 466 394 97 57 473 41 333 236
## [103] 489 364 79 122 21 312 264 129 61 229 193 477 258 56 442 315 365
## [120] 476 267 140 460 480 7 42 491 327 285 372 165 159 408 268 172 328
## [137] 235 105 313 160 70 269 323 492 345 212 385 188 190 349 311 445 482
## [154] 143 125 371 257 399 443 331 197 320 161 374 256 104 321 295 238 261
## [171] 183 275 154 426 465 213 294 173 289 111 130 206 307 440 16 490 126
## [188] 211 432 89 413 28 425 493 3 6 355 406 351 20 30 470 185 335
## [205] 106 33 252 377 376 395 83 123 225 72 409 319 142 381 407 53 17
## [222] 361 245 340 356 266 151 220 45 187 244 139 290 205 131 102 430 455
## [239] 277 309 471 114 484 500 112 58 178 246 358 304 271 38 93 502 483
## [256] 24 4 417 230 207 303 115 370 167 224 10 411 22 189 251 273 69
## [273] 118 14 297 424 37 263 368 499 64 144 387 272 113 147 201 100 302
## [290] 242 478 362 325 9 148 26 324 451 410 363 96 458 421 192 300 92
## [307] 136 150 383 63 227 191 344 132 169 357 50 199 39 416 379 134 164
## [324] 231 152 336 209 153 248 98 35 346 494 283 481 223 99 278 215 214
## [341] 474 388 339 498 431 281 196 171 259 496 133 403 80 462 348 240 233
## [358] 237 456 107 101 127 506 34 318 310 433 103 353 47 176 228 332 292
## [375] 441 428 420 503 253 255 65 2 419 94 15 378 330 373 453 29 501
## [392] 366 13 181 86 497 384 184 243 55 75 316 8 119
First the linear discriminant analysis (LDA) is fitted to the train set. The new categorical variable crime is the target variable and all the other variables of the dataset are predictor variables.
After fitting we draw the LDA biplot with arrows.
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2301980 0.2623762 0.2599010 0.2475248
##
## Group means:
## zn indus chas nox rm
## low 0.94891336 -0.9128036 -0.10299142 -0.8901527 0.4696114
## med_low -0.09446944 -0.2554719 -0.01233188 -0.5270421 -0.1686047
## med_high -0.37126774 0.1420144 0.14012905 0.3417412 0.1534452
## high -0.48724019 1.0171519 -0.03610305 1.0705681 -0.4481593
## age dis rad tax ptratio
## low -0.8955330 0.9199074 -0.6867275 -0.7298131 -0.43143267
## med_low -0.3063362 0.2848995 -0.5506543 -0.4426626 -0.05048958
## med_high 0.3523903 -0.3520024 -0.3846684 -0.3122614 -0.26232279
## high 0.8337495 -0.8768238 1.6377820 1.5138081 0.78037363
## black lstat medv
## low 0.3790972 -0.79146223 0.548967812
## med_low 0.3159215 -0.11446518 -0.001002644
## med_high 0.1150617 -0.04605376 0.201362903
## high -0.7436095 0.90589288 -0.685955738
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08193593 0.792960457 -0.80067859
## indus 0.01521157 -0.274789494 0.22967401
## chas -0.07739488 -0.058707286 0.07994890
## nox 0.45854739 -0.802198502 -1.42833105
## rm -0.09347682 -0.120166506 -0.28344817
## age 0.25864771 -0.259484013 -0.05268289
## dis -0.05534463 -0.313738490 0.01317947
## rad 3.11200441 0.783379240 -0.20514748
## tax -0.16232987 0.196824104 0.70418202
## ptratio 0.13054708 -0.003505255 -0.27355465
## black -0.13527164 0.016558718 0.12612558
## lstat 0.21510741 -0.232466710 0.39309491
## medv 0.18256868 -0.387162037 -0.11641098
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9480 0.0388 0.0132
## [1] 1 1 4 3 1 1 1 3 4 3 4 2 4 1 1 4 4 2 4 1 2 3 1 4 3 2 4 2 4 1 4 4 3 3 4
## [36] 4 2 2 1 4 2 3 1 1 2 2 2 1 1 1 2 1 3 1 1 1 4 4 2 3 3 4 2 2 2 4 4 4 4 3
## [71] 1 2 3 4 4 3 3 3 4 4 4 1 2 4 1 2 2 3 3 1 2 3 4 3 3 4 2 1 3 1 1 3 2 4 1
## [106] 1 3 3 3 3 2 3 2 4 3 1 4 3 3 4 3 3 4 4 2 2 2 3 1 4 3 3 4 3 3 2 3 2 3 3
## [141] 2 3 3 2 1 3 4 1 2 1 3 4 4 3 2 4 1 4 4 1 1 3 3 4 1 2 2 1 3 3 2 1 3 4 4
## [176] 2 2 2 1 2 3 2 1 4 3 2 2 2 4 1 4 3 4 2 1 1 1 4 1 3 3 4 2 1 2 3 2 4 4 4
## [211] 1 2 3 2 4 3 3 4 4 1 3 4 2 1 2 3 3 2 2 1 2 2 1 1 3 2 4 4 2 3 4 2 3 2 2
## [246] 1 1 2 4 2 3 1 1 1 4 3 1 4 3 2 2 2 4 3 3 2 4 3 2 2 2 2 2 3 1 4 2 3 4 2
## [281] 2 4 4 2 2 3 1 1 1 2 4 4 3 2 3 3 3 4 4 4 2 4 4 1 1 1 3 3 4 2 3 2 1 3 3
## [316] 4 2 1 2 4 4 3 3 3 3 1 2 3 2 2 3 1 2 1 4 3 1 1 3 2 4 4 1 3 4 1 1 3 3 2
## [351] 3 4 2 4 1 2 3 3 4 2 2 3 1 3 2 3 4 2 1 2 1 3 1 1 4 4 4 1 2 1 1 1 4 1 3
## [386] 4 1 4 4 3 2 4 2 1 1 3 4 2 2 1 1 2 2 2
## predicted
## correct low med_low med_high high
## low 20 11 3 0
## med_low 3 16 1 0
## med_high 0 6 15 0
## high 0 0 0 27
Prediction were quite good. There was some errors in the middle of the range but classes low and especially high were good. Only one correct representative of high class was predicted to med_low class.
I’m going to calculate what is the optimal number of clusters for Boston data. First I reload and scale the data. Variables need to be scaled to get comparable distances between observation.
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Next I calculate the distances between observations and determinen the number of clusters.
One way to determine the number of clusters is to look how the total of within cluster sum of squares (WCSS) behaves when the number of clusters changes. WCSS was calculated from 1 to 15 clusters. The optimal number of clusters is when the total WCSS drops radically. It seems that in this case optimal number of clusters is two. However we are here to learn so I decided to analyse model with four clusters.
After determining the number of clusters I run the K-means alcorithm again.
It seems that when the data is divided to four clusters there is some clear differences in distriputions of several variables. Crim, zn, indus and blacks are variables were one can distinguish clear patterns between clusters. Some variables (rad & tax) show that sometimes 1 or 2 clusters make a clear distripution but observation of other two clusters are ambigious and there is no clear pattern to be regognised.
After loading the Boston dataset I scale it to get comparable distances.
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv clust
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063 Min. :1.000
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989 1st Qu.:2.000
## Median : 0.3808 Median :-0.1811 Median :-0.1449 Median :3.000
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean :2.674
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683 3rd Qu.:3.000
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865 Max. :4.000
Original Boston dataset is now scaled and the result of K-means clustering is saved to the variable clust
Next the LDA is performed and the biplot with arrows is created
## Call:
## lda(clust ~ ., data = scaled_Boston)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.2114625 0.1304348 0.4308300 0.2272727
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3912182 1.2671159 -0.8754697 0.5739635 -0.7359091 0.9938426
## 2 1.4330759 -0.4872402 1.0689719 0.4435073 1.3439101 -0.7461469
## 3 -0.3894453 -0.2173896 -0.5212959 -0.2723291 -0.5203495 -0.1157814
## 4 0.2797949 -0.4872402 1.1892663 -0.2723291 0.8998296 -0.2770011
## age dis rad tax ptratio black
## 1 -0.6949417 0.7751031 -0.5965444 -0.6369476 -0.96586616 0.34190729
## 2 0.8575386 -0.9620552 1.2941816 1.2970210 0.42015742 -1.65562038
## 3 -0.3256000 0.3182404 -0.5741127 -0.6240070 0.02986213 0.34248644
## 4 0.7716696 -0.7723199 0.9006160 1.0311612 0.60093343 -0.01717546
## lstat medv
## 1 -0.8200275 1.11919598
## 2 1.1930953 -0.81904111
## 3 -0.2813666 -0.01314324
## 4 0.6116223 -0.54636549
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim 0.18113078 -0.5012256 -0.60535205
## zn 0.43297497 -1.0486194 0.67406151
## indus 1.37753200 0.3016928 1.07034034
## chas -0.04307937 -0.7598229 -0.22448239
## nox 1.04674638 -0.3861005 -0.33268952
## rm -0.14912869 -0.1510367 0.67942589
## age -0.09897424 0.0523110 0.26285587
## dis 0.13139210 -0.1593367 -0.03487882
## rad 0.65824136 0.5189795 0.48145070
## tax 0.28903561 -0.5773959 0.10350513
## ptratio 0.22236843 0.1668597 -0.09181715
## black -0.42730704 0.5843973 0.89869354
## lstat 0.24320629 -0.6197780 -0.01119242
## medv 0.21961575 -0.9485829 -0.17065360
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.7596 0.1768 0.0636
Biplot shows that variables indus, zn and medv are the most influencial separators for the clusters.
Colors of the both plots is based to four classes. It seems that K-means plot shows the different clusters more clearly than the plot that is based on the crime classification.
The data of this exercise originates from the United Nations Development Programme. Human development index (HDI) was created to emphasize that people and their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone.
Original data can be found from: http://hdr.undp.org/en/content/human-development-index-hdi
Modified dataset includes 9 variables and 155 observations as follows:
| variable | description |
|---|---|
| country | Country name (name of the row) |
| gni_capita | Gross national income per capita |
| life_exp | Life expectancy at birth |
| edu_years | Expected years of schooling |
| mortality | Maternal mortality ratio |
| young_mom | Adolescent birth rate |
| women_parlament | Percentange of female representatives in parliament |
| edu_ratio | Ratio between females and males at least secondary education |
| labour_ratio | Ratio between females and males in the labour force |
## 'data.frame': 155 obs. of 8 variables:
## $ gni_capita : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ life_exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ edu_years : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ mortality : int 4 6 6 5 6 7 9 28 11 8 ...
## $ young_mom : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ women_parlament: num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## $ edu_ratio : num 1.007 0.997 0.983 0.989 0.969 ...
## $ labour_ratio : num 0.891 0.819 0.825 0.884 0.829 ...
All the variables are either numerical or integers. Two of the variables are ratios (edu_ratio, labour_ratio).
## gni_capita life_exp edu_years mortality
## Min. : 581 Min. :49.00 Min. : 5.40 Min. : 1.0
## 1st Qu.: 4198 1st Qu.:66.30 1st Qu.:11.25 1st Qu.: 11.5
## Median : 12040 Median :74.20 Median :13.50 Median : 49.0
## Mean : 17628 Mean :71.65 Mean :13.18 Mean : 149.1
## 3rd Qu.: 24512 3rd Qu.:77.25 3rd Qu.:15.20 3rd Qu.: 190.0
## Max. :123124 Max. :83.50 Max. :20.20 Max. :1100.0
## young_mom women_parlament edu_ratio labour_ratio
## Min. : 0.60 Min. : 0.00 Min. :0.1717 Min. :0.1857
## 1st Qu.: 12.65 1st Qu.:12.40 1st Qu.:0.7264 1st Qu.:0.5984
## Median : 33.60 Median :19.30 Median :0.9375 Median :0.7535
## Mean : 47.16 Mean :20.91 Mean :0.8529 Mean :0.7074
## 3rd Qu.: 71.95 3rd Qu.:27.95 3rd Qu.:0.9968 3rd Qu.:0.8535
## Max. :204.80 Max. :57.50 Max. :1.4967 Max. :1.0380
There is greate variation between numeric variables. Variable like Gross national income per capita (gni_capita) differs between 581$ (min) -123,124$ (max) and at the same time there is two variables (edu_ratio, labour_ratio) that are ratios and differs both sides of number 1. Life expectancy (life_exp) and expected years of schooling (edu_years) are in years and the max observation is under 100 years. The percentange of female representatives in parliament (women_parlament) is the only variable that is represented in percentange.
Maternal mortality (mortality) ratio correlates strongly (negatively) with life expectancy, excpected years of schooling and education ratio between females and males. Maternal mortality and adolescent birth rates appears to be connected too. One could make conclusion that in some countries uneducated females give birth at very young age and that leeds to high maternal mortality rate.
Gross income per capita appears to be connected to life expectancy and expected years in schooling. Is it so that educated labour force helps society to prosper in economic sense.
Women in parliament seems to correlate with education rate between females and males. It seems that the gender equality starts from education and enables both genders to participate in social desicion-making.
First we conduct PCA with unstandardized data
## Standard deviations:
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation:
## PC1 PC2 PC3 PC4
## gni_capita -9.999832e-01 -0.0057723054 5.156742e-04 -4.932889e-05
## life_exp -2.815823e-04 0.0283150248 -1.294971e-02 6.752684e-02
## edu_years -9.562910e-05 0.0075529759 -1.427664e-02 3.313505e-02
## mortality 5.655734e-03 -0.9916320120 -1.260302e-01 6.100534e-03
## young_mom 1.233961e-03 -0.1255502723 9.918113e-01 -5.301595e-03
## women_parlament -5.526460e-05 0.0032317269 7.398331e-03 9.971232e-01
## edu_ratio -5.607472e-06 0.0006713951 3.412027e-05 2.736326e-04
## labour_ratio 2.331945e-07 -0.0002819357 -5.302884e-04 4.692578e-03
## PC5 PC6 PC7 PC8
## gni_capita -0.0001135863 2.711698e-05 8.075191e-07 -1.176762e-06
## life_exp 0.9865644425 1.453515e-01 -5.380452e-03 2.281723e-03
## edu_years 0.1431180282 -9.882477e-01 3.826887e-02 7.776451e-03
## mortality 0.0266373214 -1.695203e-03 -1.355518e-04 8.371934e-04
## young_mom 0.0188618600 -1.273198e-02 8.641234e-05 -1.707885e-04
## women_parlament -0.0716401914 2.309896e-02 2.642548e-03 2.680113e-03
## edu_ratio -0.0022935252 -2.180183e-02 -6.998623e-01 7.139410e-01
## labour_ratio 0.0022190154 -3.264423e-02 -7.132267e-01 -7.001533e-01
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
The plot doesn’t look good. A large variance of the gross income per capita variable makes it too important for the principal component analysis. The data need to be standardized.
## gni_capita life_exp edu_years mortality
## Min. :-0.9193 Min. :-2.7188 Min. :-2.7378 Min. :-0.6992
## 1st Qu.:-0.7243 1st Qu.:-0.6425 1st Qu.:-0.6782 1st Qu.:-0.6496
## Median :-0.3013 Median : 0.3056 Median : 0.1140 Median :-0.4726
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.6717 3rd Qu.: 0.7126 3rd Qu.: 0.1932
## Max. : 5.6890 Max. : 1.4218 Max. : 2.4730 Max. : 4.4899
## young_mom women_parlament edu_ratio labour_ratio
## Min. :-1.1325 Min. :-1.8203 Min. :-2.8189 Min. :-2.6247
## 1st Qu.:-0.8394 1st Qu.:-0.7409 1st Qu.:-0.5233 1st Qu.:-0.5484
## Median :-0.3298 Median :-0.1403 Median : 0.3503 Median : 0.2316
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6030 3rd Qu.: 0.6127 3rd Qu.: 0.5958 3rd Qu.: 0.7350
## Max. : 3.8344 Max. : 3.1850 Max. : 2.6646 Max. : 1.6632
## Standard deviations:
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation:
## PC1 PC2 PC3 PC4 PC5
## gni_capita -0.35048295 -0.05060876 0.20168779 -0.72727675 0.4950306
## life_exp -0.44372240 0.02530473 -0.10991305 -0.05834819 -0.1628935
## edu_years -0.42766720 -0.13940571 0.07340270 -0.07020294 -0.1659678
## mortality 0.43697098 -0.14508727 0.12522539 -0.25170614 0.1800657
## young_mom 0.41126010 -0.07708468 -0.01968243 0.04986763 0.4672068
## women_parlament -0.08438558 -0.65136866 -0.72506309 0.01396293 0.1523699
## edu_ratio -0.35664370 -0.03796058 0.24223089 0.62678110 0.5983585
## labour_ratio 0.05457785 -0.72432726 0.58428770 0.06199424 -0.2625067
## PC6 PC7 PC8
## gni_capita -0.11120305 -0.13711838 -0.16961173
## life_exp 0.42242796 -0.43406432 0.62737008
## edu_years 0.38606919 0.77962966 -0.05415984
## mortality -0.17370039 0.35380306 0.72193946
## young_mom 0.76056557 -0.06897064 -0.14335186
## women_parlament -0.13749772 0.00568387 -0.02306476
## edu_ratio -0.17713316 0.05773644 0.16459453
## labour_ratio 0.03500707 -0.22729927 -0.07304568
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
Biplots of the unstandardized and standardized data differ in many ways. Large variance of the gni_capita variable in unstandardized human data distorts the analysis and biplot. The importance of first component before standardation is almost ~100 %. When the data is scaled propely the impact of each variable is reasonable. Importance of the first component after standartation is ~54%.Also the biplot looks better when the mean of the observations is in the middle of the plot.
Variables mortality and young_mom are positively correlated to the first principal component (PC1). Gni_capita, life_exp, edu_years and edu_ratio also correlate but the correlation is negative. In the biplot arrows of the negatively correlating dimensions are against PC1 axis. Variables women_parlament and labour_ratio are not significant for PC1. Starndard deviations of standardized variables are similar. First principal component explains variance by ~53,6%.
Second principal component explains variance by ~16,2% and important variables are women_parlament and labour_ratio. Those variables correlate negatively to PC2.
In this analysis we are using dataset tea from the FactoMineR package.
## [1] 300 36
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
The tea dataset includes 300 observation and 36 variables. For the analysis we are using only certain columnns of the data.
## home work friends How sex
## home :291 Not.work:213 friends :196 alone:195 F:178
## Not.home: 9 work : 87 Not.friends:104 lemon: 33 M:122
## milk : 63
## other: 9
## healthy exciting relaxing
## healthy :210 exciting :116 No.relaxing:113
## Not.healthy: 90 No.exciting:184 relaxing :187
##
##
## sophisticated spirituality
## Not.sophisticated: 85 Not.spirituality:206
## sophisticated :215 spirituality : 94
##
##
## 'data.frame': 300 obs. of 10 variables:
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ sophisticated: Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
Next we perform a multiple correspondece analysis for these eight (8) variables and visualize it.
mca <- MCA(tea_attitude, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_attitude, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.147 0.134 0.118 0.115 0.106 0.102
## % of var. 12.236 11.141 9.809 9.620 8.803 8.519
## Cumulative % of var. 12.236 23.377 33.186 42.806 51.609 60.129
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.096 0.091 0.081 0.079 0.073 0.058
## % of var. 8.037 7.588 6.709 6.568 6.102 4.867
## Cumulative % of var. 68.165 75.754 82.463 89.031 95.133 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.401 0.365 0.161 | -0.666 1.106 0.443 | 0.161 0.073
## 2 | 0.569 0.736 0.241 | -0.254 0.161 0.048 | 0.461 0.601
## 3 | -0.427 0.414 0.205 | -0.185 0.085 0.038 | 0.218 0.134
## 4 | -0.180 0.074 0.038 | -0.178 0.079 0.037 | -0.270 0.206
## 5 | 0.149 0.050 0.018 | -0.454 0.514 0.164 | -0.081 0.019
## 6 | 0.053 0.006 0.003 | -0.734 1.342 0.600 | 0.128 0.046
## 7 | -0.129 0.038 0.022 | -0.536 0.716 0.377 | 0.248 0.174
## 8 | -0.170 0.066 0.031 | -0.379 0.357 0.154 | 0.071 0.014
## 9 | 0.247 0.138 0.051 | -0.476 0.564 0.189 | 0.101 0.029
## 10 | 0.053 0.006 0.003 | -0.734 1.342 0.600 | 0.128 0.046
## cos2
## 1 0.026 |
## 2 0.158 |
## 3 0.053 |
## 4 0.085 |
## 5 0.005 |
## 6 0.018 |
## 7 0.081 |
## 8 0.005 |
## 9 0.008 |
## 10 0.018 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## home | -0.063 0.259 0.127 -6.151 | -0.034 0.085 0.038
## Not.home | 2.023 8.360 0.127 6.151 | 1.105 2.740 0.038
## Not.work | 0.114 0.630 0.032 3.089 | -0.286 4.357 0.201
## work | -0.280 1.543 0.032 -3.089 | 0.701 10.667 0.201
## friends | -0.242 2.610 0.111 -5.749 | 0.251 3.068 0.118
## Not.friends | 0.456 4.918 0.111 5.749 | -0.472 5.783 0.118
## alone | -0.013 0.008 0.000 -0.311 | -0.169 1.391 0.053
## lemon | 0.052 0.020 0.000 0.317 | 1.236 12.575 0.189
## milk | 0.287 1.178 0.022 2.559 | -0.119 0.223 0.004
## other | -1.914 7.488 0.113 -5.822 | -0.033 0.003 0.000
## v.test Dim.3 ctr cos2 v.test
## home -3.360 | 0.055 0.250 0.098 5.413 |
## Not.home 3.360 | -1.780 8.076 0.098 -5.413 |
## Not.work -7.749 | -0.148 1.324 0.054 -4.009 |
## work 7.749 | 0.363 3.242 0.054 4.009 |
## friends 5.948 | 0.143 1.142 0.039 3.405 |
## Not.friends -5.948 | -0.270 2.152 0.039 -3.405 |
## alone -3.986 | -0.545 16.412 0.552 -12.846 |
## lemon 7.515 | 1.037 10.052 0.133 6.305 |
## milk -1.063 | 1.035 19.127 0.285 9.231 |
## other -0.102 | 0.761 1.476 0.018 2.314 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## home | 0.127 0.038 0.098 |
## work | 0.032 0.201 0.054 |
## friends | 0.111 0.118 0.039 |
## How | 0.128 0.190 0.554 |
## sex | 0.135 0.021 0.091 |
## healthy | 0.152 0.001 0.055 |
## exciting | 0.283 0.319 0.001 |
## relaxing | 0.417 0.014 0.003 |
## sophisticated | 0.034 0.184 0.273 |
## spirituality | 0.051 0.251 0.009 |
plot(mca, invisible=c("ind"), habillage = "quali")
It seems that females experience tea as a drink that correlates with spirituality, sophistication and healthy. They seem to drink it with friends and they are using lemon more than males.
Males don’t drink their tea at work. They don’t think it’s healhty and they tend to drink it alone or at least not with friends. Males use milk more than females with tea.